fix: centralize UTF-8 file I/O for Windows compatibility by joshbouncesecurity · Pull Request #45 · knostic/OpenAnt

joshbouncesecurity · 2026-05-04T18:24:59Z

Summary

Bare open() calls use the system encoding (cp1252 on Windows), causing charmap codec can't decode errors on any target codebase containing non-ASCII characters (e.g. curly quotes U+2019, accented characters, CJK).

Adds utilities/file_io.py with centralized UTF-8 helpers (open_utf8, read_json, write_json, run_utf8) and migrates ~190 bare open() call sites across libs/openant-core/ (core, parsers, utilities, openant CLI, top-level scripts) to specify encoding="utf-8" explicitly. Also sets encoding/errors on the docker_executor subprocess.run that captures container stdout/stderr as text.

A regression test scans non-test Python files for any bare open() call without an encoding= argument and fails if a regression reappears.

Addresses item 9 from #16 (does not close the issue).

Test plan

Helper unit tests: non-ASCII round-trip via open_utf8, read_json, write_json.
run_utf8 captures non-ASCII subprocess output (incl. universal_newlines=True alias).
run_utf8 only injects encoding/errors when caller asks for text mode; explicit encoding= is respected.
Regression: no bare open( in non-test code (test scans libs/openant-core/ excluding tests/ and the helper).
Existing tests pass: pytest tests/ -> 96 passed, 22 skipped (env-dependent).
Manual: run a parse against a repo with non-ASCII content on a cp1252 console -- no charmap errors.

Bare open() calls use the system encoding (cp1252 on Windows), causing 'charmap codec can't decode byte ...' errors when parsing repositories containing non-ASCII characters such as curly quotes. Adds utilities/file_io.py with open_utf8, read_json, write_json, and run_utf8 helpers, and migrates ~190 bare open() call sites across libs/openant-core/ (core, parsers, utilities, openant CLI, top-level scripts) to specify encoding="utf-8" explicitly. Also sets encoding/errors on the docker_executor subprocess.run that captures container stdout/stderr as text. Includes a regression test that scans non-test code for any bare open() call without an encoding= argument and fails if a regression reappears. Addresses item 9 from #16.

… UTF-8 Round 1 review fixes for PR knostic#45: - application_context.py, ast_parser.py, dataset_enhancer.py, report/__main__.py, report/generator.py: pass encoding='utf-8' on every Path.read_text() / write_text() call. The previous migration only covered open() calls; pathlib's text helpers also default to the system locale on Windows (cp1252) and crash on non-ASCII source code. - parsers/{c,go,javascript,php,ruby}/test_pipeline.py: pass encoding='utf-8', errors='replace' on subprocess.run(text=True) invocations of parser binaries and CodeQL. Only docker_executor.py was migrated before; these other call sites had the same Windows cp1252 hazard. - tests/test_file_io.py: extend regression scan with two new asserts — Path.read_text/write_text without encoding=, and subprocess.run(text=True) without encoding=. Refactored the call-walking logic into a shared helper. All 14 file_io tests pass; full tests/ suite: 98 passed, 22 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joshbouncesecurity · 2026-05-04T20:45:58Z

Manual verification

Windows cp1252 console: scan a target repo containing non-ASCII content (e.g., Japanese strings, accented chars in comments). Confirm no UnicodeDecodeError / charmap codec can't decode errors.
Linux/macOS: same scan succeeds (no regression).
Subprocess paths: docker_executor capturing non-ASCII output: no decode crash.
Path-method I/O: any Path.read_text() / write_text() calls covered by the round-2 follow-up still work cross-platform.
git grep 'open(' libs/openant-core --include='*.py' (excluding tests + the helper module): no bare open( calls remain. The new regression test enforces this in CI.
JSON round-trip: utilities/file_io.py write_json then read_json of a dict with {"name": "héllo 日本語"} returns the same dict (not escaped sequences).

joshbouncesecurity · 2026-05-04T21:06:29Z

Local test results

Verified the migration on Windows by static-grepping the worktree and round-tripping non-ASCII content through utilities.file_io.

Commands run:

# 1. No bare open() in non-test code
git grep -nE "open\([^)]*\)" -- 'libs/openant-core/**/*.py' \
    ':!libs/openant-core/tests/**' ':!libs/openant-core/utilities/file_io.py' \
    | grep -v "encoding=" | grep -v "open_utf8"
# Two hits, both legitimate:
#   parsers/zig/function_extractor.py:43:  with open(full_path, "rb") as f:   <- binary mode, no encoding needed
#   utilities/agentic_enhancer/prompts.py:68:  - Files: open() with external paths   <- text inside a docstring/prompt

# 2. Round-trip via utilities.file_io
python check45.py

Output:

OK default round-trip preserves non-ASCII via \uXXXX escapes
OK ensure_ascii=False writes raw UTF-8 bytes; read_json still round-trips
OK open_utf8 round-trips curly quote on this Windows shell

Outcome:

No bare open() in non-test code (only legitimate "rb" and a docstring fragment) ✅
read_json / write_json round-trip a dict with {"name": "héllo 日本語", "list": [..., "’"]} byte-identically ✅
write_json(..., ensure_ascii=False) writes raw UTF-8 bytes (verified e6 97 a5 e6 9c ac e8 aa 9e for the JP run); read_json still parses it back to the same dict ✅
open_utf8 round-trips a U+2019 curly quote on this Windows shell (default cp1252 console) ✅
Did not run a full openant scan against a non-ASCII codebase end-to-end — would have required an LLM call. The above confirms the helper works on Windows; the migration's call-site coverage is enforced by the new regression test.

ar7casper

One thing I'd like your call on before merging: utilities/file_io.py defines open_utf8 /
read_json / write_json / run_utf8, but the migration inlines encoding="utf-8" at
every call site rather than routing through these helpers. So as it stands, the helper
module is only imported by its own test file — production code never touches it.

Three ways to go:

Adopt the helpers in this PR — single chokepoint to change later (e.g. if we ever
want to switch to errors="strict" or add logging). Bigger churn though, and it's literally
one keyword argument we'd be hiding.
Keep the helpers as the recommended pattern for new code — current state. Fine, but
they'll quietly rot unless we tell contributors to reach for them. Worth a one-liner in
libs/openant-core/CLAUDE.md if you go this way.
Drop the helpers and rely on inlined encoding="utf-8" + the regression scanners as
the contract. Less surface area, one less thing to import.

The previous commit inlined encoding="utf-8" at every call site, but production code already used the file_io helpers (read_json, write_json, open_utf8) on master. This restores that pattern so the helpers are actually exercised and remain the single chokepoint for I/O policy. All four regression scanner tests (bare open, pathlib text I/O, dot open, text-mode subprocess) continue to pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ng failure Bare open() without encoding= uses cp1252 on Windows, which would crash on non-ASCII dataset content. These calls had read_json on master. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three files that used helpers on master still had inline encoding after the previous fix pass: application_context.py (save_context write), parser_adapter.py (diff_filter report write), and agentic_enhancer/repository_index.py (load_index_from_file read). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace inline open/json.load and open/json.dump patterns with read_json/write_json from utilities/file_io.py in the ten files added by this branch that were still using bare I/O: - core/diff_filter.py - core/schemas.py (StepReport.write) - openant/cli.py (all json.load calls in cmd_*) - report/__main__.py (cmd_summary, cmd_disclosures) - report/generator.py (merge_dynamic_results, generate_all) - utilities/context_enhancer.py (checkpoint read/write paths) - parsers/zig/{call_graph_builder,function_extractor,repository_scanner,unit_generator}.py All 15 regression-scanner tests pass after this change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace filepath.read_text + json.loads with read_json in check_for_manual_override, keeping read_text only for the .md parsing branch which needs the raw text for regex. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- reporter.py: open_utf8 for markdown report writes - docker_executor.py: open_utf8 for Dockerfile/compose/script writes - parsers/*/test_pipeline.py: read_json/write_json for all JSON reads and writes; open_utf8 for text file writes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- subprocess.run(..., encoding="utf-8", errors="replace") → run_utf8(...) in all test_pipeline.py files and docker_executor.py - Path.write_text(encoding="utf-8") → open_utf8 + f.write() in report/__main__.py and report/generator.py - Path.read_text(encoding="utf-8") → open_utf8 + f.read() in report/generator.py (load_prompt) Remaining read_text(errors="ignore/replace") calls in parsers and application_context.py stay inline — open_utf8 does not support the errors= override, and these reads are intentionally lenient on non-UTF-8 source files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

open_utf8 passes **kwargs through to open(), so errors="ignore" and errors="replace" work correctly. Replace all remaining Path.read_text inline uses in application_context.py, ast_parser.py, and dataset_enhancer.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joshbouncesecurity

Went with option 1. All production code (and the test pipelines) now routes through the helpers — read_json, write_json, open_utf8, and run_utf8 — rather than inlining encoding="utf-8" at every call site. The four regression scanner tests in test_file_io.py enforce this going forward and will catch any backslide on bare open(), Path.read_text/write_text, or subprocess.run(text=True) without the helpers.

The only remaining encoding="utf-8" references in production code are calls that also pass errors="ignore" or errors="replace" to read arbitrary source files — open_utf8 passes **kwargs through to open(), so those use open_utf8(path, errors="replace") etc. rather than read_text.

joshbouncesecurity · 2026-05-10T11:22:30Z

@ar7casper I think this is ok now

Adds test_no_bare_path_calls_in_typescript_analyzer, a cross-platform scanner that greps typescript_analyzer.js for any path.relative/resolve/join() call not accompanied by toPosixPath() within a ±6-line window. This mirrors the PR knostic#45 pattern (test_no_bare_open, test_no_bare_pathlib_text_io, etc.) that prevents contributors from reintroducing encoding antipatterns. Scoped to typescript_analyzer.js only — other JS parser files legitimately call path.X() without toPosixPath() since they do not interact with ts-morph. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joshbouncesecurity and others added 2 commits May 4, 2026 21:24

joshbouncesecurity marked this pull request as ready for review May 7, 2026 11:22

joshbouncesecurity requested review from ar7casper, dgeyshis, shahar-davidson, sounil and yotamleo as code owners May 7, 2026 11:22

ar7casper reviewed May 10, 2026

View reviewed changes

ar7casper mentioned this pull request May 10, 2026

fix: Windows-compatible path handling in JS/Go parsers #46

Merged

5 tasks

joshbouncesecurity and others added 2 commits May 10, 2026 13:26

fix: restore read_json in test_parser_adapter to avoid Windows encodi…

f830b42

…ng failure Bare open() without encoding= uses cp1252 on Windows, which would crash on non-ASCII dataset content. These calls had read_json on master. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joshbouncesecurity marked this pull request as draft May 10, 2026 10:35

joshbouncesecurity and others added 7 commits May 10, 2026 13:38

docs: explain why .md override uses read_text instead of read_json

0f9b7d3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joshbouncesecurity commented May 10, 2026

View reviewed changes

joshbouncesecurity marked this pull request as ready for review May 10, 2026 11:22

joshbouncesecurity changed the base branch from master to release/2026-05-10 May 10, 2026 12:37

ar7casper merged commit 8bad38d into knostic:release/2026-05-10 May 10, 2026
8 checks passed

joshbouncesecurity deleted the fix/issue16-09-utf8-io branch May 10, 2026 12:48

ar7casper mentioned this pull request May 10, 2026

chore(release): Windows compatibility & CI hardening (2026-05-10) #56

Open

5 tasks

ar7casper pushed a commit that referenced this pull request May 10, 2026

fix: centralize UTF-8 file I/O for Windows compatibility (#45)

f5854f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: centralize UTF-8 file I/O for Windows compatibility#45

fix: centralize UTF-8 file I/O for Windows compatibility#45
ar7casper merged 11 commits intoknostic:release/2026-05-10from
joshbouncesecurity:fix/issue16-09-utf8-io

joshbouncesecurity commented May 4, 2026

Uh oh!

joshbouncesecurity commented May 4, 2026

Uh oh!

joshbouncesecurity commented May 4, 2026

Uh oh!

ar7casper left a comment

Uh oh!

joshbouncesecurity left a comment

Uh oh!

joshbouncesecurity commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joshbouncesecurity commented May 4, 2026

Summary

Test plan

Uh oh!

joshbouncesecurity commented May 4, 2026

Manual verification

Uh oh!

joshbouncesecurity commented May 4, 2026

Local test results

Uh oh!

ar7casper left a comment

Choose a reason for hiding this comment

Uh oh!

joshbouncesecurity left a comment

Choose a reason for hiding this comment

Uh oh!

joshbouncesecurity commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants